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Abstract 

Neural Networks are non-linear static or dynamical systems 
that learn to solve problems from examples. Most of the 
learning algorithms require a lot of computing power and, 
therefore, could benefit from fast dedicate hardware. One 
of the most common architectures used for this special - 
purpose hardware is the Systolic Array [9]. The design and 
implementation of different Neural Network architectures in 
Systolic Arrays can be complex, however. This paper shows 
the manner in which the Hopfield Neural Network can be 
mapped into a 2-D Systolic Array and present an FPGA 
implementation of the proposed 2-D Systolic Array. 

1 Introduction 

Neural Networks (NNs) imply a requirement for massive 
fine-grained parallelism, with very high levels of 
interconnection and simple processing at each node. It is 
from this parallelism that they derive their power. 

The intrinsically parallel structure of NNs maps poorly on 
to conventional Von Neumann computer architectures. 
Therefore, these architectures are not suitable for 
implementing real-time NN systems. The mismatch 
between the parallelism required for NNs and the 
performance of sequential computer architectures is 
exacerbated as networks increase in size. As a consequence 
there is much research activity in the implementation area, 
investigating technologies and architectures that allow the 
parallelism of NNs to be mapped into hardware. 

A number of technologies have been proposed for the 
implementation of neural net systems, including analogue 
very large scale integration (VLSI), digital VLSI, opto- 



electronics and optical technologies. Each of these has some 
advantages and disadvantages. 

Conventional digital VLSI is a mature and therefore reliable 
technology. NNs implemented using this technology are 
based on simple processor array architectures with either a 
processor-per-synapse [2] [5] or a processor-per-neuron 
organization [10] [11]. The attractions of using this 
technology to implement neural systems are that precision 
can be arbitrary specified, it is possible to implement any 
training/learning algorithm, and the effects of scaling up an 
implementation are predictable. Digital implementations are 
larger and slower than analogue implementations 
nevertheless they represent the most reliable and lowest risk 
path from high level simulation to hardware implementation 
of a real-time NN. 

In this paper a 2-D Systolic Array (SA) of dedicated 
Processing Elements (PEs) named also Systolic Cells (SCs) 
is presented as the heart of a Hopfield Neural Network 
accelerator. The 2-D SA that we propose in this paper has a 
processor-per-synapse architecture; each SC holds an 
element of the synaptic weight matrix. The resulting SC 
structure is simple, and the 2-D SA has a regular 
architecture with identical SCs disposed in a mesh- 
connected array. This regular architecture is suitable for 
VLSI implementations. A FPGA implementation of the 
resulting SC is presented. The major advantage of the 
FPGA based implementation is the dynamic 
reconfigurability. 

2 The Hopfield Network 

The Hopfield model [14] is a feedback network in which 
each neuron has synaptic interconnections with all the other 
neurons. There are not input and output specialized units in 
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this architecture; each neuron is in the same time input and 
output and what it is important for the network is his state. 
Hopfield network is a dynamical system and his dynamics 
is characterized by the evolution of the neurons state in 
time. The target is to exploit this natural dynamics to 
determine the network to work like an associative memory. 
In the discrete form the Hopfield network dynamics can be 
described as follows: 

N 

P i (t) = X w ij- x j( t " 1 ) o> 



placed on the east board of the SA will compute it. The 
resulting vector y will be the input vector for the new 
recurrence and must be reinserted from north into the SA. 
Being not local data transmission, this loop destroys the 
homogeneity of the data transmissions and can generate 
increased propagation delays, which will affect the global 
performances of the S A. It is preferable to convert this loop 
into local data transmissions to the nearest neighbor (SC). 

4 An Architecture with Local Data Transmissions 

This architecture is based on 3 rules: 



X|(t) = o(p l (t)-2e l ) 



(2) rl: Systolic sequencing of the operations inside the SA. 



where Wij represents a coefficient or synaptic weight 
associated with the j-th input Xj and the i-th neuron. The 
weighted sum p; is called potential and a represents the 
activation function. That is, the recall phase of the Hopfield 
network consists in a recurrence of following form: 



x ,k, = o(W.x [k11 ); k=U N 



(3) 

The convergence occur when x lN| = x IN ' u . Therefore, x lNJ 
represents the response (output vector) of the Hopfield 
network (characterized by the weight matrix W) to the input 
vector x 101 . 

3 The 2-D Systolic Architecture with Fixed Weights 

The proposed 2-D SA is a processor-per-synapse 
architecture. For an N-components input vector x, the W 
matrix will have N 2 weights, which correspond to the N 2 
Systolic "Cells (SCs) in the 2-D SA. The input vector 
components are loaded into the SA from north and cross the 
array to south. The partial sums (PSs) cross the SA from 
west to east and the output vector components are obtained 
on the east board of the SA. The moving data are: 

• the x vector components which cross the SA from 
north to south 

• the partial sums PSs which cross the SA from west 
to east. 

Each SC (i j) receives the Xj component of the x vector at 
time t, computes the local product wyXj and adds this 
product to the partial sum PSj received from west. The new 
calculated PSj is sent to the neighboring SC on east. We 
will call recurrence step the time for computing the W x 
multiplication. 

When an output vector component y 4 comes out on the east 
board of the S A, the activation function a will be applied, o 
function will be not integrated inside SA; specialized units 



r2: Each SC communicates only with the four nearest 
neighboring SCs. 

r3: At the end of a recurrence the output vector will be in 
the same initial position in which was the input vector 
at the beginning of the recurrence; that is, the output 
(new input) vector is ready to start a new recurrence. 

The basic idea refers to the initial position of the input 
vector, which will be placed on the main diagonal of the SA 
(figure 2). All the SCs will have identical internal structures 
(figure 1) and will communicate data with only the four 
nearest neighboring cells. 
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Figure 1: Logic diagram of the SC 
5 The Systolic Algorithm 

The algorithm for computing an output vector (a recurrence 
step) can be divided into 3 successive phases: 

• gradual computing of the partial sums. 

• applying the activation function a. 

• positioning the output (new input) vector for a new 
recurrence. 

We will present a recurrence step for a 3x3 weight matrix. 
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5.1 First Stage 

How already stated, the input vector x will be placed in the 
main diagonal of the SA. Complying with rl rule we will 
compute the fust product (figure 2). If we note: 

PSjj - the partial sum in the SC(i j). 
x i<r - the i-th component of the input vector x during the 
recurrence step r. 

the SA will compute the first partial sum: 

PS I( i = w ir x M (4) 









w 2l S 













Figure 2: First stage (r-recurrence step) 
5.2 Second Stage 

During the second stage (figure 3) will compute the w 2i and 
W| 2 cells. The data will be transmitted inside the SA so that 
will be applied to the mentioned SCs: 

"• the weights (W matrix components) are resident 
data. 

• the input vector components Xu will be propagated 
to the north and will be reflected successively on the 
north board of the S A. 

• the partial sum PS 1( i will be transmitted towards 
east. 

After these local data transmissions, the local products will 
be calculated and will be accumulated to the partial sums, 
which enter into the two SCs from west: 



PS u = PS u + w ir x u 
PS 2 ,i = w 2 ,x M 



(5) 
(6) 



At the end of this stage, the partial sums PS U and PSu are 
located into the w 12 and w 2 i SCs. 



S3 Third Stage 

The mobile data will be transmitted like in the previous 
stage. The wi 3 , W22 and w 3J cells will compute (figure 4): 



PS u = PS u + w, r x 3il 
PS w = PS 2 , l + w 22 -x 2 , 1 
PS 3J =w 3l x u 



(7) 
(8) 
(9) 



At the end of the third stage, the first component of the 
output vector (the first final sum) will be stored in the w n 
SC. 
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Figure 4: Third stage 
a) data communications b) calculus 

5.4 Fourth Stage 

The first component is available on the east board of the SA 
and will be processed applying the activation function 0. 
That is, the first component of the new input vector will be 
generated: 



x,.2 = otPS u ) 



(10) 



The mobile data will be transmitted like in figure 5 and the 
W23 and w 32 cells will compute: 



PS 2 ,3=PS2.2+W 2 3-X3,| 

PS 3(2 = PS 3il + w 32 .x 2il 



(11) 
(12) 
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Figure 3: Second stage 
a) data communications b) calculus 



Figure 5: Fourth stage - applying the activation function 0 
and the retransmission towards the main diagonal of the 
Xij+i component 

5.5 Fifth Stage 

It is similar to the previous stage; the second component of 
the resulting vector will be retransmitted towards the main 
diagonal of the SA and the last final sum will be calculated: 
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X2,2 = 0(PS2,3) 



(13) 



6 FPGA implementation of the 2-D SA 



PS 3 ,3 = PS 3a +W 3 3X3, 1 



(14) 



5.6 Sixth Stage 

In the sixth stage the Wx multiplication will be completed, 
the activation function a will be applied to the last 
component of the resulting vector and the resulting vector 
will be placed on the main diagonal of the SA (figure 6): 



x 3l2 = ct(PS 3 , 3 ) 



(15) 
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Figure 6: Sixth stage 
a) data communications b) calculus 

At the end of the sixth stage, both the new x,+i vector 
components and the old x, vector components are located in 
the main diagonal of the SA. The rl, r2 and r3 rules where 
complied and^ the SA is ready for a new recurrence. The 
presence of both x, and x r+l vectors in the main diagonal of 
the SA permits the convergence detection of the Hopfield 
algorithm (comparing the two vectors). Comparing the two 
components xj, and x ir+ i will give a local result connected 
to the convergence detection. The global convergence 
signal, for a NxN SA, will be computed as in (16) and 
systolically generated (figure 7). 



Conv(N) = JJ(Xi, r =x i , r+1 ) = rjcV i 



(16) 
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In order to evaluate the systolic implementation of different 
NNs architectures we designed also the 2-D SA (practically 
the SC structure) for other neural algorithms (i.e. SOFM 
algorithm). Using the XILINX software, we evaluated, also, 
the implantability of the 2-D SA into the FPGA structures; 
we designed and successfully tested the SC block. The 
FPGA based implementation is feasibly (i.e. XC 40250 
FPGA chip, which contains 20000 CLBs can hold a 22x22 
SA [5]. A speed-up board with 36 XC 40250 chips can hold 
a 132x132 SA. 

The major advantage of the FPGA based implementation is 
the dynamic reconfigurability. This implementation may be 
used to build a speed-up hardware integrated in a host 
computer. The speed-up hardware will not execute any time 
all neural algorithms. In order to keep a very simple SC 
structure, it is convenient to classify the neural algorithms 
into classes (based on the resemblance between different 
algorithms). Each class will have a dedicated SC structure. 
Therefore, the SC structure will be different from a class to 
another but all SCs will have a common feature: simplicity 
and, therefore, high processing speed at hardware level. 
Depending on the neural application issued by the user, the 
software driver of the neuroprocessor board will 
automatically configure the SC structure by programming 
the FPGA chips. Therefore, we can obtain maximum 
processing speed at hardware level for each neural 
algorithm. 



conv(N) 



conv(N) 



conv(N) 




Figure 7: Systolic propagation of the convergence signal 

In order to increase the 2-D SA performances, the 
convergence detection for the current recurrence step and 
the output vector computing for the next recurrence step 
will be overlapped systolic operations. 



Figure 8: The SC implemented in the XC 4025E chip 
7 Performance 

The two-dimensional SA proposed in this paper represents 
an implementation that integrates synapses; integration of 
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synapses facilitates network extension and input/output 
communications. Although no practical performance 
measurements have been realized so far, a theoretical peak 
performance can be calculated using the two metrics for 
neurocomputers: 

1. number of connections per second (CPS), for recall 
phase 

2. number of connection updates per second (CUPS), 
for learning phase 

The expected peak performance of the 2-D SA is given by: 
N 2 f 

P = — -U (17) 

"op-Nps 

where: N 2 - total number of SCs 
f - clock frequency 

nop - number of operations required to compute a 
connection (recall phase) or to update it 
(learning phase) 

Np S - number of clock cycles per operation 

U - utilization rate of S A 

Considering the Hopfield network (nop=I) implemented in 
the 132x132 SA (36 XC4O250 FPGA chips) working at a 
clock frequency of f=100MHz, with N PS =64 (serial 
communications of 64 bits integers), the peak performance 
(U=100%) will be: 

^ 132 2 100*10 6 

P = ™ • 100% = 27.225 GCUPS 

1-64 

This value compare well to those reported for 
supercomputers or other NN dedicated systems. 

The square W matrix of the Hopfield NN fits well with the 
square SA. Other NN architectures (including multilayer 
feedforward NN) have rectangle weight matrices and is 
necessary to add zero elements in order to arrive to a square 
matrix that fit with the square SA. The utilization rate U can 
be less than 100% but in most of the cases is bigger than 
70%. The described architecture represents a good 
compromise between cost and performance, between 
simplicity and regularity of design (thus implying a reduced 
cost) and generality of use. 

7 Conclusions 

There are a variety of models in the field of NNs, which 
differ by structure and by the performed functions. The 
choice of the appropriate NN architecture, for a defined 
application, can't be based on well-defined criterions. 



Therefore, the research activity in the field of NN requires 
more and more efficient tools. Speed-up hardware 
integrated in a host system, like a workstation, can be a 
good compromise between cost and performance. The SA 
presented in this paper represents our proposition for this 
speed-up hardware. The advantages of the proposed systolic 
architecture can be resumed as follows: 

• Modularity. It can be detected at different levels: 

■ The SC is built by using three types of functional 
blocks: register, adder and multiplexor. 

■ The VLSI chip contains a mesh-connected matrix of 
identical SCs. 

■ The SA, in whole, will be built like a mesh- 
connected chip array, with interconnections between 
nearest neighbor chips. 

• Extensibility. The SA can be extended in a very simple 
way by adding new FPGA chips to the existent mesh- 
connected array of FPGA chips. 

• Simplicity. SC has a very simple structure, based on 
some registers and operators. Therefore, the SC design, 
test and performance evaluation processes can be fast 
and sure. 

• Reconfigurability. Programming FPGA chips can 
configure the 2-D SA. Software driver can do this in a 
dynamic way, depending on the neural algorithm issued 
by the user. 

• Fine-grain parallelism. The 2-D SA presented in this 
paper integrates synapses and not neurons. It is a fine- 
grain parallelism, which is much more suitable with the 
distributed structure of NNs; NNs are high parallel, 
distributed and fault tolerant systems. 
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